Skip to content

feat: add GiGPO anchor state computation to WAADesktopEnv#122

Merged
abrichr merged 5 commits into
mainfrom
feat/gigpo-anchor-state
Mar 17, 2026
Merged

feat: add GiGPO anchor state computation to WAADesktopEnv#122
abrichr merged 5 commits into
mainfrom
feat/gigpo-anchor-state

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 17, 2026

Summary

  • Add compute_anchor_state() function using a11y tree SHA256 hash (primary) with screenshot MD5 fallback
  • Include state_key in info dict from both reset() and step() for efficient cross-rollout grouping
  • No new dependencies — uses stdlib hashlib

Context

GiGPO groups identical intermediate states across rollouts to assign per-action advantages. The state_key enables O(1) grouping instead of VAGEN recomputing perceptual hashes across all rollout steps.

Test plan

  • All 38 existing verl env tests pass
  • Verify state_key is present in info from reset() and step()
  • Validate against real WAA VM with WAADesktopEnv

🤖 Generated with Claude Code

abrichr and others added 5 commits March 16, 2026 22:08
Add compute_anchor_state() function that produces a state key for
GiGPO cross-rollout grouping. Uses a11y tree SHA256 hash (primary)
with screenshot MD5 fallback. The state_key is included in the info
dict from both reset() and step() so VAGEN/verl can use it for
O(1) anchor grouping instead of recomputing perceptual hashes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add dated addendum (2026-03-16) correcting the earlier conflation of
VAGEN and verl-agent as a single project. Key findings: VAGEN-Lite
dropped Bi-Level GAE (only vanilla GRPO/PPO), GiGPO lives exclusively
in verl-agent which uses its own env_base.py interface (not GymImageEnv),
and our train_verl_e2e.py targets the wrong entry point. Outlines a
corrected two-phase path: standalone GRPO first, then direct verl-agent
integration if per-step credit is needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers desktop RL landscape (30+ projects), per-step credit assignment
alternatives (HCAPO recommended over GiGPO), scaling architectures
(ComputerRL, DART-GUI), and synthetic environment feasibility
(GUI-Genesis). Includes revised architecture recommendation:
standalone GRPO + HCAPO first, then dense rewards + API-GUI hybrid,
then async scaling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ng math

HCAPO and per-step credit are Phase 3 optimizations, not Phase 1.
The bottleneck is rollout success rate (getting non-zero rewards),
not loss computation. Dense partial-credit rewards and API-GUI hybrid
actions directly increase gradient signal and should come first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abrichr abrichr merged commit 72aa537 into main Mar 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant